- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 19.2k
CLN: Consolidate decimal string parsing functions in tokenizer.c #62823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Remove heap allocation (malloc/free) in round_trip by using stack-allocated buffer with PROCESSED_WORD_CAPACITY. This improves performance and simplifies memory management.
| } | ||
| p = buffer; | ||
| char *endptr; | ||
| int status = _str_copy_decimal_str_c(buffer, PROCESSED_WORD_CAPACITY, p_item, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, in the integer parsing functions, it only processed the word if it identified the necessity for it by checking for the presence of tsep. Now you are always doing it. For me, this doesn't seem necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the performance of code you said would be better.
However, even if there is no tsep, processing such as skipping whitespaces is necessary, so various string processing logic before and after must be reverted as before.
If I have to, I also think about that maybe it would be better to use copy_string_without_char like before.
So instead of sacrificing a little bit of performance, I think it is reasonable the current version of consolidating and simplifying logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I'm not an expert, so I’d appreciate your advice. If you still think it’s better to make the change even after considering what I said, I’ll follow your recommendation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, even if there is no
tsep, processing such as skipping whitespaces is necessary,
I honestly don't see much problem in having the repeated logic of dealing with leading and trailing whitespace as it was before.
If I have to, I also think about that maybe it would be better to use
copy_string_without_charlike before.
Considering the goal of the issue was to create a superset of this function, I think it's better to use the function that you are rewriting.
So instead of sacrificing a little bit of performance, I think it is reasonable the current version of consolidating and simplifying logic.
This is a valid point. I am ambivalent about this. I would wait for @WillAyd opinion on this topic.
| #define SKIP_NSPAN(s, n, charset) \ | ||
| str_consume_nspan(NULL, 0, &(s), (n), (charset)) | ||
|  | ||
| #define SAFE_CONSUME_SPAN(d, de, s, charset) \ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overall seems pretty complicated - what didn't work with strtok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If _str_copy_decimal_str_c only needed to skip whitespaces, I would have used strtok.
However, the function’s requirements are more complex:
- it needs to skip certain character sets selectively,
- handle different token types (tsep,decimal,digits), and
- avoid modifying the source string (since strtokremoves delimiters).
I initially experimented with using strtok just for whitespace skipping, but it only made the code deeper and harder to follow.
Using it to simplify more complex parsing turned out to be difficult — at least with my current approach.
Also, because strtok modifies the input string, I would have needed to make a copy first to avoid side effects.
(I’m not entirely sure how critical those side effects are in this context, but I chose to take the conservative route.)
For these reasons, I decided to use strspn and memcpy instead.
Up until commit 7b0be60, the code remained relatively simple, but after adding repetitive safety checks (such as buffer overflow prevention), it evolved into its current form.
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.Description
This PR consolidates two similar string manipulation functions in
tokenizer.c:_str_copy_decimal_str_c: handled decimal/thousands separator replacement with heap allocationcopy_string_without_char: removed characters with caller-provided bufferThe refactored
_str_copy_decimal_str_cnow combines the strengths of both approaches:strspn,memcpy) instead of byte-by-byte processingChanges
_str_copy_decimal_str_cto write to caller-provided buffer instead of heap allocationstr_consume_span, etc.) for efficient string parsingcopy_string_without_charfunctionstr_to_int64andstr_to_uint64to use the refactored functionround_tripfunction with stack buffer allocation